Introduction

This analysis examines simulated income data for Hungary, focusing on the relationships between income and various demographic factors including age, location, occupation, and gender. The dataset is simulated to reflect realistic patterns while maintaining a manageable size for analysis.

Data Simulation

The data was simulated by the data_simulation.R script. The data is available in the hungarian_income_data.csv file.

Important things to note about the generated data:

  • Only the 8 most populated cities of Hungary are taken into count weighted by their population. List of cities: Budapest, Debrecen, Szeged, Miskolc, Pécs, GyÅ‘r, Szombathely, Eger.

  • Only the 10 most common occupations are taken into count weighted by their frequency in the workforce. List of occupations: Software Developer, Teacher, Doctor, Sales Representative, Engineer, Accountant, Nurse, Manager, Chef, Driver.

  • The age distribution is generated by a beta distribution with parameters \(\alpha = 2\) and \(\beta = 3\) and multiplied by 95 to put the end result in the desired range. The beta distribution with the aforementioned parameters skews the age distribution towards younger ages, which is more realistic.

  • There are three groups of people categorized by their age:

    • Underage: each person has a random age at which they start working between 14 and 24.

    • Working age: 19-67

    • Pension age: each person has a random retirement age between 60 and 75.

  • Under 18 people have no income.

  • Working age people have a regular income based on their age, occupation, city, and gender.

  • Pension age people have a pension based on their occupation and city.

  • All working age people are considered to be employed.

  • The income of a working age man is 20.000 HUF higher than the income of a working age woman in the same occupation, city, and age group.

Descriptive Statistics

summary_stats <- summary(data)
kable(summary_stats, caption = "Summary Statistics of the Dataset") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
Summary Statistics of the Dataset
age city occupation gender income starting_age retirement_age
Min. : 0.00 Length:10000 Length:10000 Length:10000 Min. : 0 Min. :14.00 Min. :60.00
1st Qu.:23.00 Class :character Class :character Class :character 1st Qu.: 74373 1st Qu.:18.00 1st Qu.:65.00
Median :36.00 Mode :character Mode :character Mode :character Median :320922 Median :19.00 Median :67.00
Mean :37.87 NA NA NA Mean :296226 Mean :18.99 Mean :66.99
3rd Qu.:51.00 NA NA NA 3rd Qu.:469238 3rd Qu.:20.00 3rd Qu.:69.00
Max. :94.00 NA NA NA Max. :856706 Max. :24.00 Max. :75.00
cat("\nCity Distribution:\n")
## 
## City Distribution:
print(table(data$city))
## 
##    Budapest    Debrecen        Eger        Győr     Miskolc        Pécs 
##        3468        1543         658         806         974         853 
##      Szeged Szombathely 
##        1160         538
cat("\nOccupation Distribution:\n")
## 
## Occupation Distribution:
print(table(data$occupation))
## 
##           Accountant                 Chef               Doctor 
##                 1190                  507                  492 
##               Driver             Engineer              Manager 
##                  533                  980                  775 
##                Nurse Sales Representative   Software Developer 
##                 1209                 2063                  772 
##              Teacher 
##                 1479
cat("\nAge Distribution Summary:\n")
## 
## Age Distribution Summary:
print(summary(data$age))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   23.00   36.00   37.87   51.00   94.00
cat("\nRetirement Age Distribution:\n")
## 
## Retirement Age Distribution:
print(table(data$retirement_age))
## 
##   60   61   62   63   64   65   66   67   68   69   70   71   72   73   74   75 
##  139  201  347  549  829 1079 1187 1339 1247 1044  841  526  346  177   79   70
cat("\nAge Group Distribution:\n")
## 
## Age Group Distribution:
print(table(cut(data$age, breaks = c(0, 15, 65, Inf), labels = c("0-14", "15-64", "65+"))))
## 
##  0-14 15-64   65+ 
##  1305  7774   920
cat("\nIncome Distribution Summary:\n")
## 
## Income Distribution Summary:
print(summary(data$income))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   74373  320922  296226  469238  856706
cat("\nIncome Distribution by Age Group:\n")
## 
## Income Distribution by Age Group:
print(tapply(data$income, cut(data$age, breaks = c(0, 18, 65, Inf), labels = c("Under 18", "Working Age", "Pension Age")), summary)) 
## $`Under 18`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0    9084       0  374737 
## 
## $`Working Age`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0  275889  394134  385521  500014  856706 
## 
## $`Pension Age`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0   38531   59379  145714  104985  820253

Demographics Analysis

library(forcats)

data <- data %>%
  mutate(age_group = cut(age, 
                        breaks = seq(0, 100, by = 2), 
                        right = FALSE, 
                        include.lowest = TRUE, 
                        labels = seq(0, 98, by = 2)))

dem_pyramid <- data %>%
  group_by(age_group, gender) %>%
  summarise(count = n(), .groups = 'drop') %>%
  mutate(count = ifelse(gender == "Male", -count, count))

ggplot(dem_pyramid, aes(x = age_group, y = count, fill = gender)) +
  geom_bar(stat = "identity", width = 0.8, color = "black") +
  scale_y_continuous(labels = abs, expand = expansion(mult = c(0.05, 0.05))) +
  scale_fill_manual(values = c("Male" = "#00BFFF", "Female" = "#FF3B3B")) +
  coord_flip() +
  labs(title = "Population Pyramid of Simulated Hungarian Data",
       x = "Age Group",
       y = "Count",
       fill = "Gender") +
  custom_theme +
  theme(legend.position = "top",
        axis.text.y = element_text(size = 10, face = "bold"),
        plot.margin = margin(t = 20, r = 20, b = 20, l = 20))

ggplot(data %>% filter(age >= 18), aes(x = income, fill = gender)) +
  geom_density(alpha = 0.6) +
  scale_fill_viridis_d() +
  labs(title = "Income Distribution by Gender",
       subtitle = "(working age only)",
       x = "Income (HUF)",
       y = "Density") +
  custom_theme

ggplot(data, aes(x = age, y = income, color = gender)) +
  geom_point(alpha = 0.1, width = 0.2) +
  scale_color_viridis_d() +
  labs(title = "Income Distribution by Age",
       subtitle = "(working age only)",
       x = "Age",
       y = "Income (HUF)") +
  custom_theme

ggplot(data %>% filter(age >= 18), aes(x = reorder(occupation, income, FUN = median), y = income, color = occupation)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(alpha = 0.1, width = 0.2) +
  scale_color_viridis_d() +
  coord_flip() +
  labs(title = "Income Distribution by Occupation",
       subtitle = "(working age only)",
       x = "Occupation",
       y = "Income (HUF)") +
  custom_theme

ggplot(data %>% filter(age >= 18), aes(x = reorder(city, income, FUN = median), y = income, fill = city)) +
  geom_violin(alpha = 0.7) +
  geom_boxplot(width = 0.2, alpha = 0.5) +
  scale_fill_viridis_d() +
  coord_flip() +
  labs(title = "Income Distribution by City",
       subtitle = "(working age only)",
       x = "City",
       y = "Income (HUF)") +
  custom_theme

ggplot(data %>% filter(age >= 18), aes(x = age, y = income, color = gender)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "loess", se = TRUE) +
  scale_color_viridis_d() +
  labs(title = "Relationship between Age and Income",
       subtitle = "(working age only)",
       x = "Age",
       y = "Income (HUF)") +
  custom_theme

income_by_category <- data %>%
  filter(age >= 18) %>%
  group_by(occupation, city, gender) %>%
  summarise(
    mean_income = mean(income),
    count = n(),
    .groups = 'drop'
  ) %>%
  arrange(desc(mean_income))

# heatmap
ggplot(income_by_category, aes(x = city, y = occupation, fill = mean_income)) +
  geom_tile() +
  scale_fill_viridis(name = "Mean Income (HUF)") +
  facet_wrap(~gender) +
  labs(title = "Mean Income by Occupation, City, and Gender",
       subtitle = "(working age only)",
       x = "City",
       y = "Occupation") +
  custom_theme +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        strip.text = element_text(face = "bold"))

top_earners <- income_by_category %>%
  arrange(desc(mean_income)) %>%
  head(10)

kable(top_earners, 
      caption = "Top 10 Highest Earning Combinations",
      digits = 0) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), 
                full_width = FALSE)
Top 10 Highest Earning Combinations
occupation city gender mean_income count
Software Developer Budapest Female 555020 104
Software Developer Budapest Male 499488 113
Software Developer Debrecen Male 494909 45
Doctor Budapest Male 486931 70
Software Developer Szeged Female 481669 46
Manager Szeged Male 478305 32
Manager Budapest Male 467616 122
Software Developer Szeged Male 447791 39
Doctor Budapest Female 443706 91
Software Developer Miskolc Female 443251 28

Hypothesis Testing

Parametric Tests

1. Gender Income Difference (t-test)

# Test if there's a significant difference in income between genders
t_test_result <- t.test(income ~ gender, data = data)
print(t_test_result)
## 
##  Welch Two Sample t-test
## 
## data:  income by gender
## t = -3.7426, df = 9996.3, p-value = 0.0001832
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -24073.503  -7524.036
## sample estimates:
## mean in group Female   mean in group Male 
##             288268.7             304067.4

2. City Income Differences (ANOVA)

# Test if there are significant differences in income between cities
city_anova <- aov(income ~ city, data = data)
print(summary(city_anova))
##               Df    Sum Sq   Mean Sq F value Pr(>F)    
## city           7 1.182e+13 1.688e+12   38.83 <2e-16 ***
## Residuals   9992 4.344e+14 4.347e+10                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3. Occupation Income Differences (ANOVA)

# Test if there are significant differences in income between occupations
occupation_anova <- aov(income ~ occupation, data = data)
print(summary(occupation_anova))
##               Df    Sum Sq   Mean Sq F value Pr(>F)    
## occupation     9 1.529e+13 1.698e+12   39.38 <2e-16 ***
## Residuals   9990 4.309e+14 4.313e+10                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Non-parametric Tests

1. Gender Income Distribution (Kolmogorov-Smirnov Test)

# Test if income distributions differ between genders
male_income <- data$income[data$gender == "Male"]
female_income <- data$income[data$gender == "Female"]
ks_test <- ks.test(male_income, female_income)
print(ks_test)
## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  male_income and female_income
## D = 0.046334, p-value = 4.361e-05
## alternative hypothesis: two-sided

2. Income Distribution by City (Kruskal-Wallis Test)

# Test if income distributions differ between cities
kruskal_test <- kruskal.test(income ~ city, data = data)
print(kruskal_test)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  income by city
## Kruskal-Wallis chi-squared = 263.84, df = 7, p-value < 2.2e-16

Regression Analysis

Multiple Linear Regression

# Convert categorical variables to factors
data_working_age <- data %>% filter(age >= 18)
data_working_age$city <- as.factor(data_working_age$city)
data_working_age$occupation <- as.factor(data_working_age$occupation)
data_working_age$gender <- as.factor(data_working_age$gender)

# Fit the model
model1 <- lm(income ~ age + I(age^2) + city + occupation + gender, data = data_working_age)

# Create a beautiful model summary table
model_summary <- summary(model1)
kable(tidy(model_summary), caption = "Multiple Linear Regression Results (Working Age Only)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
Multiple Linear Regression Results (Working Age Only)
term estimate std.error statistic p.value
(Intercept) -564005.7039 8854.984102 -63.693587 0.0000000
age 43984.4061 367.359676 119.731176 0.0000000
I(age^2) -440.9603 3.838656 -114.873631 0.0000000
cityDebrecen -45773.0706 3463.020593 -13.217672 0.0000000
cityEger -93940.7010 4782.185556 -19.643885 0.0000000
cityGyőr -89210.9797 4445.406543 -20.068126 0.0000000
cityMiskolc -96508.6864 4148.656395 -23.262637 0.0000000
cityPécs -89476.1672 4352.994886 -20.555082 0.0000000
citySzeged -41476.6860 3898.003624 -10.640495 0.0000000
citySzombathely -90903.1668 5287.112689 -17.193348 0.0000000
occupationChef -27765.9163 6017.900655 -4.613887 0.0000040
occupationDoctor 59233.3095 6080.352617 9.741756 0.0000000
occupationDriver -45601.3129 5994.714141 -7.606920 0.0000000
occupationEngineer 22666.2477 4893.333483 4.632067 0.0000037
occupationManager 44978.3699 5268.892985 8.536588 0.0000000
occupationNurse -23498.2703 4629.489927 -5.075780 0.0000004
occupationSales Representative -52814.4425 4119.434346 -12.820800 0.0000000
occupationSoftware Developer 93523.3378 5242.671238 17.838871 0.0000000
occupationTeacher -13507.9435 4414.906643 -3.059621 0.0022232
genderMale 18514.3476 2276.242924 8.133731 0.0000000
# Enhanced model diagnostics
par(mfrow = c(2,2))
plot(model1, col = income_palette[1], pch = 19, cex = 0.7)

Polynomial Regression for Age-Income Relationship

# Fit polynomial regression
model2 <- lm(income ~ poly(age, 3), data = data_working_age)

# Create a static plot
ggplot(data_working_age, aes(x = age, y = income)) +
  geom_point(alpha = 0.1, color = income_palette[1]) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 3), 
              color = income_palette[5], fill = income_palette[5], alpha = 0.2) +
  labs(title = "Polynomial Regression: Age vs Income",
       subtitle = "Cubic polynomial fit with confidence interval (Working Age Only)",
       x = "Age",
       y = "Income (HUF)") +
  custom_theme

# Model comparison
model_comparison <- data.frame(
  Model = c("Multiple Linear", "Polynomial"),
  R_squared = c(summary(model1)$r.squared, summary(model2)$r.squared),
  Adj_R_squared = c(summary(model1)$adj.r.squared, summary(model2)$adj.r.squared)
)

kable(model_comparison, caption = "Model Comparison (Working Age Only)") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
Model Comparison (Working Age Only)
Model R_squared Adj_R_squared
Multiple Linear 0.6765319 0.6757960
Polynomial 0.5780235 0.5778722

Predictions

# Create a sample prediction
new_data <- data.frame(
  age = 35,
  city = "Budapest",
  occupation = "Software Developer",
  gender = "Male"
)

# Predict income
prediction <- predict(model1, newdata = new_data, interval = "prediction")
print(prediction)
##        fit      lwr      upr
## 1 547309.8 343154.5 751465.1

Summary and Conclusions

The analysis of the simulated Hungarian income data reveals several interesting patterns:

  1. There is a significant gender pay gap, with males earning more on average than females.
  2. Income varies significantly across different cities, with Budapest showing the highest average income.
  3. Occupation has a strong effect on income, with software developers and doctors earning the most.
  4. Age shows a non-linear relationship with income, peaking in the middle age range.
  5. The regression models explain a significant portion of the income variation.

The analysis demonstrates the complex interplay between various factors affecting income levels in Hungary. The findings suggest that both demographic factors (age, gender) and professional factors (occupation, location) play important roles in determining income levels.